Add DeepSeek V3.2 support #3270

wangxiyuan · 2025-09-29T10:11:44Z

What this PR does / why we need it?

This PR added the initial DeepSeek V3.2 support with vLLM v0.11.0 (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days.

Related doc: #3223 .

Does this PR introduce any user-facing change?

Yes!

How was this patch tested?

CI passed and Run deepseek doc soon.

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@releases/v0.11.0

github-actions · 2025-09-29T10:11:55Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a new Sparse Flash Attention (SFA) mechanism, primarily for DeepSeek V2 models on Ascend hardware. The changes are extensive, adding new attention backends, modifying model definitions, and updating distributed communication logic to support the new KV cache structure for SFA. While the implementation adds significant new functionality, I have identified a critical issue in multi-stream metadata handling that could lead to incorrect attention calculations. Additionally, there are several instances of hardcoded configuration values that make the implementation brittle and prone to errors with different model variants. I've also flagged a few TODO comments in critical code paths that suggest the logic may be incomplete or unverified. Addressing these points is crucial for the stability and correctness of the new SFA feature.

gemini-code-assist · 2025-09-29T10:15:16Z

vllm_ascend/attention/sfa_v1.py

+        return model_input_split_v1_mla_attn(
+            ms_split_config=ms_split_config,
+            attn_metadata=self,
+            _metadata_cls=AscendMLAMetadata,
+        )


The split_metadata_for_multistream method is incorrectly using AscendMLAMetadata as the _metadata_cls when splitting metadata. Since this method is part of AscendSFAMetadata, it should be using its own class, AscendSFAMetadata, to ensure the correct metadata structure is propagated in multi-stream scenarios. Using the wrong metadata class can lead to incorrect attention calculations or runtime errors.

Suggested change

return model_input_split_v1_mla_attn(

ms_split_config=ms_split_config,

attn_metadata=self,

_metadata_cls=AscendMLAMetadata,

)

return model_input_split_v1_mla_attn(

ms_split_config=ms_split_config,

attn_metadata=self,

_metadata_cls=AscendSFAMetadata,

)

gemini-code-assist · 2025-09-29T10:15:16Z

vllm_ascend/attention/sfa_v1.py

+            prefill_q_pe = torch_npu.npu_interleave_rope(
+                prefill_q_pe, cos_q, sin_q)  # BNSD
+            prefill_q_pe = prefill_q_pe.squeeze(2)  #BSH
+            # q[..., self.qk_nope_head_dim:] = prefill_q_pe # TODO:????


This line contains a TODO:???? comment, which indicates uncertainty about the implementation. This part of the code seems related to applying rotary position embeddings. Leaving such a TODO suggests the logic might be incomplete or incorrect, which could lead to erroneous attention results. This should be investigated and resolved before merging.

gemini-code-assist · 2025-09-29T10:15:16Z

vllm_ascend/attention/sfa_v1.py

+    def mla_epilog(self,
+                   attn_output: torch.Tensor = None,
+                   absorb: bool = False):
+        # TODO: need to check


A TODO: need to check comment is present in the mla_epilog function. This indicates that the logic in this function may not have been fully verified. Unchecked code, especially in critical paths like the attention epilogue, poses a risk and should be reviewed and confirmed to be correct before merging.

gemini-code-assist · 2025-09-29T10:15:16Z

vllm_ascend/models/deepseek_v2.py

+        # TODO(zzzzwwjj): wait transformers add these params
+        self.n_heads: int = 64  # 64
+        self.head_dim: int = 128  # 128
+        self.index_topk: int = 2048  # 2048


The parameters n_heads, head_dim, and index_topk for the indexer are hardcoded. This makes the implementation brittle and may cause silent errors or incorrect behavior if used with model variants that have different configurations. These values should be derived from the model config object rather than being hardcoded.

gemini-code-assist · 2025-09-29T10:15:17Z

vllm_ascend/models/layers/sfa.py

+    def __init__(self,
+                 config,
+                 dim: int = 7168,
+                 n_heads: int = 64,
+                 head_dim: int = 128,
+                 index_topk: int = 2048,
+                 q_lora_rank: int = 1536,
+                 rope_head_dim: int = 64,
+                 quant_config: Optional[QuantizationConfig] = None,
+                 prefix: Optional[str] = ""):


The Indexer class __init__ method uses hardcoded default values for several dimensional parameters like dim, n_heads, head_dim, etc. This is risky as it might not match the actual model configuration passed during instantiation, leading to silent errors or incorrect behavior. These parameters should be passed explicitly from the model's configuration during initialization, and the defaults should be used cautiously.

gemini-code-assist · 2025-09-29T10:15:17Z

vllm_ascend/patch/worker/patch_common/patch_attentionspec.py

+        sfa_bytes = 128 * self.block_size * get_dtype_size(
+            self.dtype) if self.use_sfa else 0


The value 128 is hardcoded in the sfa_bytes calculation. This value appears to correspond to the indexer's head dimension. Using magic numbers reduces code maintainability and can lead to bugs if the indexer's configuration changes. This value should be retrieved from a configuration source or defined as a named constant to improve clarity and robustness.

gemini-code-assist · 2025-09-29T10:15:17Z

vllm_ascend/torchair/models/torchair_deepseek_v2.py

+        self.dim: int = config.hidden_size  # 7168
+        # TODO(zzzzwwjj): wait transformers add these params
+        self.n_heads: int = 64  # 64
+        self.head_dim: int = 128  # 128
+        self.index_topk: int = 2048  # 2048


The parameters dim, n_heads, head_dim, and index_topk for the indexer are hardcoded. This makes the implementation brittle and may cause issues if used with model variants that have different configurations. These values should be derived from the model config object instead of being hardcoded to ensure the implementation is robust.

gemini-code-assist

Code Review

This pull request introduces a new Sparse Flash Attention (SFA) backend for Ascend NPUs, specifically for the DeepSeek-V2 model. The changes include the core SFA implementation, metadata builders, torchair integration, and patches to vLLM to select and configure this new backend. While the implementation is extensive, I've found several critical issues related to incorrect path selection between prefill and decode operations in mixed batches, which could lead to incorrect attention results. There are also some maintainability concerns with hardcoded values. Addressing these issues is crucial for the correctness and robustness of the new backend.

gemini-code-assist · 2025-09-29T10:15:24Z

vllm_ascend/attention/sfa_v1.py

+    def apply_attention_fusion(self, query_states, key_states, topk_indices,
+                               attn_metadata: M):
+        # repeat k/v heads if n_kv_heads < n_heads
+        q_nope, q_pe = query_states
+        k_nope, k_rope = key_states
+
+        if attn_metadata.prefill is not None:
+
+            prefill_metadata = attn_metadata.prefill
+
+            slc_fa_fusion = torch.ops.custom.npu_sparse_flash_attention(
+                query=q_nope,
+                key=k_nope,
+                value=k_nope,
+                sparse_indices=topk_indices,
+                scale_value=self.scale,
+                sparse_block_size=1,
+                block_table=prefill_metadata.block_table,
+                actual_seq_lengths_query=prefill_metadata.query_lens,
+                actual_seq_lengths_kv=prefill_metadata.seq_lens,
+                query_rope=q_pe,
+                key_rope=k_rope,
+                layout_query="TND",
+                layout_kv="PA_BSND",
+                sparse_mode=3,
+            )
+
+        elif attn_metadata.decode is not None:


The logic to distinguish between prefill and decode paths is incorrect. In a mixed batch with both prefill and decode requests, attn_metadata.prefill will be non-None. This will cause the decode_attn_output computation to incorrectly enter the prefill branch, leading to wrong attention results. The elif for the decode path will never be reached in a mixed batch.

To fix this, you should add a boolean flag, e.g., is_prefill, to the apply_attention_fusion function signature. Then, use this flag to select the correct code path.

def apply_attention_fusion(self, query_states, key_states, topk_indices, attn_metadata: M, is_prefill: bool): # ... if is_prefill: prefill_metadata = attn_metadata.prefill assert prefill_metadata is not None # ... prefill logic else: # is_decode decode_metadata = attn_metadata.decode assert decode_metadata is not None # ... decode logic

You will also need to update the call sites in the forward method to pass this flag:

is_prefill=False for the decode call (line 827).

is_prefill=True for the prefill call (line 835).

gemini-code-assist · 2025-09-29T10:15:24Z

vllm_ascend/attention/sfa_v1.py

+    def indexer_select(
+        self,
+        x: torch.Tensor,
+        qr: torch.Tensor,
+        kv_cache: Tuple[torch.Tensor, torch.Tensor, torch.Tensor],
+        attn_metadata: M,
+    ):


This function has the same issue as apply_attention_fusion. The if attn_metadata.prefill is not None: check is not sufficient to distinguish between prefill and decode paths in a mixed batch. This will cause incorrect metadata to be used for decode requests.

Please add an is_prefill flag to the function signature and use it to correctly select the metadata.

def indexer_select( self, x: torch.Tensor, qr: torch.Tensor, kv_cache: Tuple[torch.Tensor, torch.Tensor, torch.Tensor], attn_metadata: M, is_prefill: bool, ): if is_prefill: prefill_metadata = attn_metadata.prefill assert prefill_metadata is not None cos = prefill_metadata.cos # ... else: # is_decode decode_metadata = attn_metadata.decode assert decode_metadata is not None cos = decode_metadata.cos # ...

The call sites in _sfa_preprocess should be updated to pass is_prefill=False for the decode path (line 690) and is_prefill=True for the prefill path (line 778).

gemini-code-assist · 2025-09-29T10:15:25Z

vllm_ascend/torchair/torchair_sfa.py

+            # Profiling run.
+            return output
+
+        if attn_metadata.prefill is not None:


The condition if attn_metadata.prefill is not None: is incorrect for handling mixed batches. In a batch with both prefill and decode requests, this condition will always be true, and the decode path in the elif block (line 1099) will never be reached. This will cause decode requests to be processed with prefill logic, which is incorrect.

The AscendSFATorchairMetadataBuilder sets is_prefill and is_decode flags. Please use if attn_metadata.is_prefill: here, and elif attn_metadata.is_decode: for the decode block to ensure correct path selection.

gemini-code-assist · 2025-09-29T10:15:25Z

vllm_ascend/torchair/torchair_sfa.py

+    ):
+        if attn_metadata.prefill is not None:
+            cos = attn_metadata.prefill.cos
+            sin = attn_metadata.prefill.sin
+        elif attn_metadata.decode is not None:
+            cos = attn_metadata.decode.cos
+            sin = attn_metadata.decode.sin
+


The is_prefill parameter is passed to this function but it is not used. Instead, the function uses if attn_metadata.prefill is not None:, which is incorrect for mixed batches and will lead to using prefill metadata for decode requests.

Please refactor this function to use the is_prefill flag to correctly select the metadata from either attn_metadata.prefill or attn_metadata.decode.

gemini-code-assist · 2025-09-29T10:15:25Z

vllm_ascend/patch/worker/patch_common/patch_attentionspec.py

+        sfa_bytes = 128 * self.block_size * get_dtype_size(
+            self.dtype) if self.use_sfa else 0


The value 128 is hardcoded for calculating sfa_bytes. This seems to correspond to the head_dim of the SFA indexer. Hardcoding this value makes the code less maintainable and harder to adapt for models with different indexer dimensions.

This value should be retrieved from the model configuration and passed to AttentionSpec instead of being hardcoded.

gemini-code-assist

Code Review

This pull request introduces support for a new Sparse Flash Attention (SFA) backend, primarily for DeepSeek V2 models on Ascend NPUs. The changes are extensive, adding new attention implementations for both eager mode (sfa_v1.py) and torch.compile/torchair mode (torchair_sfa.py), along with necessary modifications across the model definition, configuration, and distributed components. While the implementation is comprehensive, I've identified several critical and high-severity issues related to correctness and maintainability, including incorrect metadata class usage, type mismatches, hardcoded values, and significant code duplication. These issues should be addressed to ensure the stability and correctness of the new SFA backend. The pull request description is also missing, which would be very helpful for understanding the context and testing of these complex changes.

vllm_ascend/attention/sfa_v1.py

vllm_ascend/patch/worker/patch_common/patch_attentionspec.py

vllm_ascend/worker/model_runner_v1.py

vllm_ascend/torchair/torchair_sfa.py

Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: zzzzwwjj <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: wxsIcey <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: wangxiyuan <[email protected]>

…or deepseek_r1 Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: MengqingCao <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: linfeng-yuan <[email protected]>

Signed-off-by: wangxiyuan <[email protected]>

Signed-off-by: linfeng-yuan <[email protected]>

Signed-off-by: wangxiyuan <[email protected]>

wangxiyuan · 2025-09-29T19:15:04Z

ut passed here: #3277

wangxiyuan · 2025-09-29T19:21:07Z

e2e passed here: https://github.com/vllm-project/vllm-ascend/actions/runs/18106037676?pr=3270

github-actions bot added module:ops module:core module:quantization labels Sep 29, 2025

Yikun changed the title ~~New model~~ Init v0.11.0 related code Sep 29, 2025

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 29, 2025

Yikun changed the title ~~Init v0.11.0 related code~~ Add DeepSeek V3.2 support Sep 29, 2025

Yikun mentioned this pull request Sep 29, 2025

[New Model]: Support DeepSeek V3.2 #3272

Open

github-actions bot removed module:ops module:quantization labels Sep 29, 2025

wangxiyuan force-pushed the new_model branch from 39ea752 to c3120ba Compare September 29, 2025 12:05

zzzzwwjj and others added 15 commits September 29, 2025 21:48

[feat]adapt new model

c6cd5e0

Signed-off-by: wangxiyuan <[email protected]>

fix lint

c398240

Signed-off-by: wangxiyuan <[email protected]>

address v0.11.0

8f06bf7

Signed-off-by: wangxiyuan <[email protected]>

[fix] fix torchair & ci bug

efbf5b6

Signed-off-by: zzzzwwjj <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

fix ci

f13fc24

Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

adapt to new custom_ops interface

c4b5a53

Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

Rebase main

b0b1efa

Signed-off-by: wxsIcey <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

fix broken deepseek_r1

e236a5f

Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

fix lint

3b711de

Signed-off-by: wangxiyuan <[email protected]>

fix attribute error with torchair and shape mis match in eager mode f…

0214ba8

…or deepseek_r1 Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

fix lint

2b7bf51

Signed-off-by: wangxiyuan <[email protected]>

fix torchair mtp

f61a3eb

Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

fix patch bug

0c02a74

Signed-off-by: wangxiyuan <[email protected]>

make attn_group compatible with 0.10.2

9f2e670

Signed-off-by: MengqingCao <[email protected]> Signed-off-by: wangxiyuan <[email protected]>

remove enable_force_load_balance

35375a4

Signed-off-by: wangxiyuan <[email protected]>

wangxiyuan added 2 commits September 29, 2025 21:49

fix lint, and config check

3ded8e2

Signed-off-by: wangxiyuan <[email protected]>

fix lint

2738c6f

Signed-off-by: wangxiyuan <[email protected]>

wangxiyuan force-pushed the new_model branch from 9666738 to 2738c6f Compare September 29, 2025 13:49

wangxiyuan and others added 7 commits September 29, 2025 22:49

skip ut

93ea7bf

Signed-off-by: wangxiyuan <[email protected]>

adapt to dsv3.2 config.json

8e3bd8f

Signed-off-by: linfeng-yuan <[email protected]>

fix lint

31ab60c

Signed-off-by: wangxiyuan <[email protected]>

fix lint

fd34eba

Signed-off-by: wangxiyuan <[email protected]>

fix mtp for deeepseek_v32 model_type

ba84bc6

Signed-off-by: linfeng-yuan <[email protected]>

fix lint

3ecda61

Signed-off-by: wangxiyuan <[email protected]>

skip ut

d9dc18a

Signed-off-by: wangxiyuan <[email protected]>

wangxiyuan merged commit 81bd6e4 into vllm-project:main Sep 29, 2025
16 checks passed

		sfa_bytes = 128 * self.block_size * get_dtype_size(
		self.dtype) if self.use_sfa else 0

Add DeepSeek V3.2 support #3270

Add DeepSeek V3.2 support #3270

Conversation

wangxiyuan commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxiyuan commented Sep 29, 2025

Uh oh!

wangxiyuan commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

wangxiyuan commented Sep 29, 2025 •

edited by github-actions bot

Loading